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Although the RC technique solved the problem of output-flow determinism there was 
no broad coverage of other current research in the very specialized field of real-time data- 
flow machines. 

This thesis reviews Revolving Cylinder analysis and then surveys, compares, and 
evaluates research in the field using the review as a baseline for comparison. The RC 
approach is best at improving the throughput and output flow determinism of a narrow 
range of applications on a particular architecture. Each of the other approaches offer 
improvements over RC scheduling in either performance as measured by throughput or 
through flexibility in applications handled. For each of these improvements, however, 
significant trade-offs are made and so improvements become relative when they affect 
system robustness and an ability to handle repeated execution of application graphs. The 
AN/UYS-2 can implement RC scheduling with a minimum of cost and no hardware 


reconfiguration and this makes it the best approach for short-term system improvement. 
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I. INTRODUCTION 


Today’s military battlefield is one of ever increasing lethality. Tomorrow’s combatants 
must have the ability to respond to threats within milliseconds to ensure their survival. 
These narrowing reaction windows necessitate both accurate and timely responses. Real- 
time processors ensure that these responses are performed within a known, guaranteed 
bound or deadline. This allows the designer of an application to use that bound with 
confidence that the system will return a result swiftly and reliably. Examples of real-time 
systems currently in use are those in aircraft cockpits, weapon sensors, and navigation 
systems. All of these handle increasingly complex tasks at high data rates and must do so 
without failure. 

The robustness of these systems 1s vital because of the tremendous penalty for failure. 
Most real-time systems are embedded in some larger system and must have a high degree 
of fault tolerance to ensure the survivability of the platform [Levine 91]. Many of today’s 
real-time systems have multiprocessor based architectures which increase throughput by 
sharing workloads. This facilitates graceful degradation in the event of failure by having 


multiple instances of each resource amongst which to spread a load. 


A. Digital Signal Processing 

Digital Signal Processing (DSP) is one of the applications standing to benefit by a 
departure from von Neumann style architectures. It is widespread and of particular use to 
the military on platforms ranging from submarines to spacecraft. DSP applications are well 
suited for description using Large Grain Data-flow Graphs (LGDF) because they can be 
described using a combination of mathematical expressions and block diagrams. The data- 
flow paradigm preserves the integrity of the flow of data and as a result allows the natural 


exploitation of any concurrency in the graph [Lee 87]. 


B. The AN/UYS-2 

In the 1980's the Navy realized the potential of data-flow architectures and developed 
the AN/UYS family of DSP’s. The AN/UYS-?2, the system with which we are most 
concemed, was developed in order to introduce a standard DSP for military land, sea, and 
air applications. It is a variable-configuration multiprocessor based on the use of Standard 
Electronic Modules or SEM’s. There are two different SEM’s available: Type B and type 
E. The type E modules perform the same functions as those of type B but they are smaller, 
lighter, and more power efficient. They were developed for aircraft use because of the 
limitations imposed by limited space/lft in an airframe. 

The modules are built from off-the-shelf hardware and are used to construct the 


processor’s Functional Elements (FE) [Rice 90]. 


Extemal Environment 


Data to and 
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Figure 1: AN/UYS-2 Architecture [Little 91]. 


1. System Architecture 
The system’s modular design is based on six different functional elements. 
These are the Scheduler (SCH), the Arithmetic Processor (AP), the Global Memories 
(GM), the Input/Output Processor (IOP), the Command Program Processor (CPP), and the 
Input Signal Conditioner (ISC). Each of these performs specific functions in the 
architecture and they are all connected by two buses, the Control Bus (CBUS) and the Data 


Transfer Network (DTN) (Figure 1). 


2. The System’s Use of Data-Flow 
Because DSP algorithms involve minimal decision making they are ideally 
Suited for a data-flow machine. We avoid the inherent penalties involved in multiple 
branching and can minimize communication overheads if we choose granularity correctly. 
The data-flow paradigm and its implementation in the AN/UYS-2 are reviewed in the 
following chapter and a detailed description of the machine is in [Little 91] and [Bell 92]. 
By mapping nodes of the LGDF graph to processors as their data becomes available we 
naturally schedule and then execute the algorithm. Nodes are mapped to Arithmetic 
Processors and edges correspond to the data flows on the DTN and the FE CBUS. Currently 
the system uses a First Come First Served (FCFS) algorithm to schedule nodes on 
processors. This approach takes advantage of the inherent strengths of multiple processing 


by attempting to schedule nodes to any available processor. 


3. The Revolving Cylinder Technique 
In real-time DSP the two most desired properties are predictability and 
throughput performance [Little 91]. Unfortunately, the inherent non-determinism of the 
data flows in a LGDF graph can be exacerbated by an arbitrary policy of resource conflict 
resolution and thus degrade the predictability of output. 
The research efforts of Zaky and Shukla [Shukla 92] of the Naval Postgraduate 
School seek to improve the efficiency of resource allocation in the AN/UYS-2 and thus 


effect a reduction in the unpredictability of the DSP’s output arrival. The resulting 


scheduling technique is called the “Revolving Cylinder”. The key idea of the technique is 
that it inserts synchronization arcs in the LGDF graph in order to improve throughput. It 
restructures the graph by performing a compile-time analysis of each application execution 
profile. Each node in the graph is scheduled to run at its earliest possible start time. If that 
is not possible due to dependencies then it is delayed until the dependency is satisfied. The 
restructured graph is then mapped to a specific number of AP’s to determine whether it 
satisfies the required data rate. This technique ensures maximum processor usage by only 


giving resources to those nodes capable of executing at that time [Little 91]. 
C. OBJECTIVES AND ORGANIZATION 


1. Objectives 

The purpose of this thesis 1s to review current research in the field of scheduling 
real-time applications on data flow architectures and then attempt to find possible 
improvements to the Revolving Cylinder. The thesis distills the salient features of the 
Revolving Cylinder technique and establishes a framework of comparison. This becomes 
a benchmark against which to compare the methodologies of other real-time scheduling 
research. Current techniques are reviewed and then compared to the Revolving Cylinder 
with emphasis on the differences, strengths, and weaknesses of each when viewed in the 


context of the framework. 


2. Organization 
Chapter II consists of a brief review of Digital Signal Processing and the data- 
flow paradigm. This familiarizes the reader with the task of the AN/UYS-2 and the reasons 
a data-flow architecture is so uniquely suited to the task. Chapter III covers the Revolving 
Cylinder in depth and establishes the primary features of the technique in order to establish 
a reference framework for comparison with other techniques. Chapter IV covers current 


research efforts in real time multiprocessor scheduling and compares them using the 


framework of the RC as a reference. Chapter V is the conclusion in which 


recommendations are made and in which future research possibilities are covered. 


It. BACKGROUND 


A. DATA-FLOW IMPLEMENTATION OF DSP 

The military has a number of applications which use digital signal processing 
techniques in their implementations. These include radar and sonar systems, image 
processing, speech recognition, etc. Each is of vital strategic concern to the nation given 
our increased requirements for sensors, control, and intelligence information. The Naval 
Postgraduate School is working on improving DSP performance for systems operating in 


real-time environments such as the AN/UYS-2. 


1. DSP Performance 

The current and future performance needs of DSP applications require ever- 
increasing throughput capacities. This necessitates the use of cutting-edge and extremely 
expensive hardware. Yet as hardware technologies improve we approach the physical 
limitations of single processor architectures. A processor capable of 1 billion operations per 
second requires a 1 nanosecond clock period. At this point we start to see the limitations 
imposed by the speed of light because a signal can only move 20cm in silicon during such 
a short interval. This causes huge design problems in terms of skewed clock signals, size 
limitations, and performance degradation [Meng 91]. 

An attractive alternative to increasing single processor performance is the use of 
multiple processors concurrently working on a single task. A multiple processor’s potential 
to divide a job up and perform it faster means higher throughput with less expensive 
hardware. 

The first hurdle, however, is that sequential programming languages fail to fully 
exploit concurrency because the programmer spends a great deal of time countering the 
basic design of the language by using special instructions designed to spawn parallelism. 
Development and debugging are difficult because of the contradiction between language 
structure and programming task. Languages and applications whose properties promote 


parallelism are thus the easier to implement. 


Data flow introduces the notion of values applied to functions rather than 
instructions fetching the contents of memory cells as in conventional control flow [Gaudiot 
87]. Conventional Von Neumann machines declare an instruction ready when a program 
counter points to it. This event is usually under the direct control of the programmer. A 
control flow program is a sequential listing of instructions whereas as a data flow program 
is best represented as a graph in which nodes are instructions which communicate with 


other nodes using the edges of the graph as illustrated in Figure 2. 


Input r 


Node 1 must have tokens on both of its input arcs before it can 
fire. Similarly, node 2 must have the result of node 1] and data coming 
in on its other arc. This ensures that node 2 waits for node 1 even 
though they fire asynchronously. 





Figure 2: An illustration of a data-flow computation (Gaudiot 87]. 


Signal processing algorithms are appropriate for description by functional 
languages and are often represented by mathematical expressions and a graph form (see 
Figure 3 below). Using Graphical representations of an application allows the programmer 
to utilize an intuitively obvious representation of a task. A DSP graph is best implemented 
by a vector operation (i.e., a loop in which all iterations present no dependencies among 


themselves) which easily delivers parallelism by compiler analysis or programmer 


inspection. It usually consists of simple constructs such as arithmetic instructions, FFT 


butterfly networks, simple filters, and so on. 


A Graphical representation of a second order digital filter. The forks 
replicate each input sample on all output paths. The “D” on two of the 
arcs indicates delay and the “1’’s adjacent to each node indicate that a 
single token is produced or consumed on that edge when the node fires 
[Lee 87] 





Figure 3: Example of a DSP application’s graphical representation 
{Lee 87]. 
DSP expressions readily translate into data-flow graphs. An instruction is 
declared executable when it has all its operands (see Figure 2). We can see the utility of a 
paradigm which encapsulates nodes so naturally. In the graphical representation above this 
means that all the input arcs to a node must carry data values (referred to as tokens) before 
the node is executed. Execution proceeds by first absorbing the input tokens, processing the 
input values according to the instructions of the of the node, and accordingly producing 
result tokens on the output arcs [Gaudiot 87]. The graphical representations of a DSP are 
highly similar to those of a data-flow algorithm and as such map naturally to an architecture 


using this paradigm. 


The graphical description of a digital filter (Figure 3) is a directed, acyclic graph 
and could be implemented on a data-flow machine. The nodes represent large grain 


computations which can be selected from a library of signal processing functions. 


2. Data Flow Implementation of DSP 

Practical implementations of a data-flow approach require some mechanism for 
both the management of data flows and the capture of the built-in scheduling and 
synchronization properties of the graph. These mechanisms typically operate at run-time 
and result in overheads that lead to sub-optimal performance. The amount of overhead 
depends upon the granularity of the graph and on the amount of recursion or branching 
present. Research, in fact, shows that a hardware implementation of the data-flow paradigm 
for general applications results in unmanageable overheads [Shukla 92]. 

Our problem lies in finding tasks that can use current multiprocessor technology 
to increase throughput speeds. DSP naturally yields a great deal of useful parallelism 
because we know, a priori, the amount of data produced and consumed during execution 
and that there is negligible use of decision making or branching at in the application. 

Data-flow graphs describe the dependencies between the different functional 
nodes of an application. They also provide intrinsic scheduling and synchronization 
because the executability of an instruction is decided by local criterion only and the 
presence of the operands is sensed locally by each instruction. This is an attractive property 
for an implementation running in a distributed environment. 

If we choose the granularity of the nodes correctly then the effect of each 
operation is limited to the production of results consumed by a specific number of other 
nodes. This precludes the existence of side-effects which may effect the state of a cell of 
memory used only much later by some other unrelated operation. Granularity has the added 


benefit of keeping interprocessor communications to a minimum. The generality of this 


representation allows us to specify parallelism from the instruction level all the way up to 


the task level. 


B. Data-flow implementation of DSP on The AN/UYS-2 
Applications are specified as data-flow graphs with nodes representing large grain 
computations chosen from a library of signal processing functions. The edges of a graph 
represent queues which receive data from the source node and supply data to the destination 
node. Each queue is allocated amemory module for storage which maintains its current size 
and remaining capacity. 
As data arrives on all the input queues of a node, the threshold values (the minimum 
number of data items that must be present in a queue for its destination to become ready) 
associated with each queue are eventually exceeded. A node is ready for execution when 


two conditions are satisfied: 
(1) All incoming queues exceed their thresholds and 


(2) all output queues must be under their capacity values. 

All memory modules communicate the events of threshold/capacity crossing to the 
scheduler which determines if a node 1s ready. Initially all processors are on the Free 
Processor List (FPL) and the scheduler assigns them nodes as they are placed on the Ready 
Node List (RNL). 


1. Setup, Execution, and Breakdown 
When a node is assigned to a processor it fetches the data and the instruction 
stream corresponding to the node from the appropriate memory module. When the entire 
instruction stream and queue data are fetched the setup of the node is complete. Each 
processor communicates this event to the scheduler to get itself placed on the FPL so that 
the next node may start setup. Thus, the node already setup begins execution while the next 
node on the RNL begins setup. This occurs under the restriction that a processor may have 


only one node set up and pending to execute at any time. The data generated by an 


execution is first stored locally. Upon completion, a processor transfers the data to the 
appropriate memory-module stonng the output queues in what is referred to as the 
breakdown phase. 

Every node goes through three phases at a processor: Setup, Execution, and 
Breakdown. Since their functions are independent and the set-up/breakdown operations 
may require time comparable to the execution time, these operations could be overlapped 
by providing independent functional units for data movement and execution in the 


processor. 


2. Performance Degradation 

Upon arnival of sufficient data at nodes which only receive input from the 
outside world, an instance of the graph 1s started and its execution proceeds according to 
the data-flow principle. As a result of the data-flow execution, which corresponds to 
asynchronous task-level pipelining, several instances of the graph are active 
simultaneously. 

Aside from the requirement that the required throughput must be met by the 
machine, real-time performance may require that all instances of the graph should complete 
in the same amount of time. Between the completion of the setup of a node at a processor 
and the actual start of its execution, there may be a delay because the execution unit at a 
processor has not completed the previous node. This delay is in addition to the delay a ready 
node may experience waiting on the RNL. Both delays result in an increase in the latency 
of the graph execution. 

On the other hand. an execution unit may have to wait for the setup completion 
of the next node assigned to it after it completes its current node. If this happens, execution 
cycles are lost and the machine’s throughput degrades. 

To maximize throughput all execution units must run continuously so each 
processor must have anode set up for execution at the time it finishes the previous node’s 


computation. Because the scheduler 1s a simple run-time dispatcher that matches RNL 
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nodes to free processors, the delays descnbed above depend upon the application’s 
execution profile. This profile depends upon the data rate, the spatial and temporal 
parallelism in the graph, the number of processors in the system, the number of memory 
modules, and the allocation of queues to memory modules. 

Since task-level parallelism is being considered, performance can be improved 
significantly if setup and breakdown cost can be minimized. One method to reduce this cost 
is to chain successive nodes together and execute them on a single processor one after the 
other. This results in saving the breakdown cost for the first node and setup cost for the next 


node. 


C. Unpredictability in Program Behavior 

In real-time environments the ability to predict a program’s performance is cnitical for 
efficient allocation of resources such as memory modules, processors, and queue sizes. The 
AN/UYS-2’s use of the First-Come-First-Served (FCFS) paradigm for assignment of 
processors to ready nodes degrades its performance in two ways: Irregular execution 
patterns and interference/contention in the memory modules. 

When data arrives periodically, unpredictable execution pattems arse due to the 
absence of direct control over execution of nodes that depend only upon the receipt of data 
from the external world. If the output queue capacities for these nodes are unlimited they 
execute at a rate that matches the input arrival rate and are independent of the rate at which 
other nodes execute. In the presence of finite queues, they execute at the input rate until the 
output queues are filled and then stall until nodes down the graph create space in the queues 
by consuming data from the output queues. This leads to the individual graph instances not 
being executed uniformly. This is undesirable in real-time environments because it leads to 
non-deterministic output rates and thus cannot guarantee that minimum performance 


bounds will remain inviolate. 
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A simple data-flow graph. Letters 
label nodes. arcs are tokens, and 
numbers are the execution times for 





Figure 4: A sample input graph for the AN/UYS-2 [Little 91]. 


Figure 4 is an illustration of a simple data-flow graph and Table 1 is a possible schedule 
of execution for that graph. The table shows how the schedule might run in an environment 
in which the inputs from the outside world readied an “A” node for the RNL on every cycle. 
Without any additional scheduling management the RNL swiftly fills with the second and 
third instances of the graph before the system has a chance to fully execute the first 
instance. FCFS guarantees that the first instance of a graph will finish before the next but 
it cannot provide anything close to deterministic output as it approaches heavy loads 
Machine throughput can degrade because the memory access patterns may be such that 


there is contention at the memory modules while setting up and breaking down nodes. 


TABLE 1: A possible execution of the graph of Figure 4 under FCFS 





In the following section, a framework is presented that introduces synchronization 
dependencies in the graph based on the technique of revolving cylinder analysis. This 
technique addresses the problem illustrated above by inserting extra dependencies in the 
graph and then enforcing them at run-time. In this way we avoid much of the overhead of 
run-time scheduling management by using the execution profile of the graph to do the work 


for us. 
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HI. Revolving Cylinder Analysis 


Revolving Cylinder Analysis generates a new data-flow graph as a result of compile- 
time analysis [Shukla 92]. This provides built-in run-time support for the system scheduler. 
The Revolving Cylinder restructures the application, described as a task-level data-flow 
graph, by mapping it on the surface of a hypothetical cylinder whose dimensions are 
determined by both the number of system processors and the sum of node execution times. 

The technique results in increased predictability in simulations of typical DSP 
applications [Shukla 91]. It differs from other research in that it uses the application profile 
of the graph to reduce the scheduling overheads that make data-flow so difficult to 
implement. The essential features of RC analysis are outlined in this chapter in order to 


establish a framework of comparison with current research in this field. 


A. An Introduction to RC Analysis 

The key to RC analysis is that the insertion of dependencies in the application graph 
will result in both increased throughput performance and more deterministic output rates. 
These added dependencies change the point at which a node will enter the Ready Node List 
(RNL) based on whether or not its predecessors higher in the LGDF graph are complete and 
whether previous iterations of the graph are complete. The actual scheduling of a node toa 
processor 1s left to the scheduler (SCH) at run-time. The goal is to allow scheduling to 
remain dynamic and thus keep overheads low. 

The Revolving Cylinder automatically determines whether an application can meet 
real time requirements during graph compilation. Having done so it then restructures the 
graph so that it will have more deterministic throughput and output arrival rates. This 
ensures that each instance of a node completes without the creation of an execution backlog 


in the lower nodes as discussed in Chapter II. 
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Given the simple application graph in Figure 5, RC analysis determines whether it can 
be mapped to a set number of processors while still satisfying a required data rate. For 


reasons of brevity the costs of setup and breakdown for each node are ignored. 


A simple data-flow graph. 
Letters label nodes. arcs are tokens, 
and numbers are the execution 





Figure 5: Reference data-flow graph [Little 91] 


It can be proved that, as long as communications overheads are ignored, the optimum 
throughput for an application is the sum of node execution times divided by the number of 
available processors. As an example, a system with 2 processors executing the graph of 
figure 5 has an execution time of (12/2 = 6) cycles. The optimum result is that the system 
could start a new instance of the graph every 6 cycles as long as it avoids the scheduling 


pitfalls akin to those of FCFS discussed in the previous chapter [Little91]. 


B. Insertion of Delays 

The idea of delays in the execution graph provides a stepping stone to the concepts of 
the revolving cylinder. If we insert artificial delays into the graph we can overlap the 
execution of subsequent instances of a node because the delays force the graph to execute 
uniformly despite the fact that some nodes may have their data available before others. 
Using the simple application graph of the previous example as a starting point we insert the 
delays required to ensure that an instance of the graph can be executed and overlapped 


every six cycles. The altered graph is shown in figure 6. 


A simple data-flow graph with delays 


inserted 





Figure 6: The graph of Figure 5 with additional delays [Little 91]. 


The delays seem counterintuitive for improved performance until we realize that they 
facilitate the control and execution of multiple instances of an application [Shukla92]. They 
help control the execution of the graph by forcing the system to wait on execution of a node 
until the nodes higher in the graph are begun. Table 2 depicts the schedule table of one 
instance of the application of Figure 5, with delays, executing on the system. By inspection 
of the schedule we see that another instance can be started every six cycles because the 


delays keep the execution of the graph free of the latencies found in the FCFS algorithm. 


TABLE 2: A template for the execution of the graph of Figure 6 
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TABLE 3: Execution profile of the RC schedule for Figure 6 at any point after start-up 


With the exception of the first 6 cycles of the schedule, which represent a transient, 









every subsequent group of six consecutive cycles could be summarized by the schedule in 
Table 3. With this paradigm we are almost at the heart of the Revolving Cylinder but for 
One important difference. The artificial insertion of delays works well as a run-time 
scheduling mechanism but it is difficult to implement during compile-time analysis. We 
want a simple technique which will take advantage of the inherent scheduling of the graph 


at compile-time so as to keep run-time overheads low. 


C. Implementation of The Revolving Cylinder 

RC scheduling recommends when a graph node is scheduled at compile-time (..e.: 
Statically) but choosing the AP to schedule it on 1s left to the run-time dispatcher. This 
enables execution scheduling to remain dynamic. The reason for implementing the 
algorithm as a cylinder is that data arrives periodically and so the application is invoked 


cyclically [Little 92]. 


1. Mapping Nodes to The Cylinder 


The idea is to schedule the graph such that it wraps around the cylinder and its 
end meets its beginning. Let us assume that there is a cylinder whose circumference is the 


intended execution length of the schedule in Table 3 (6 nodes with a total of 12 cycles to 
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be executed in the example) and whose height is the number of processors (2). The table 
can be wrapped around the cylinder such that its beginning meets its end. The line on the 
surface of the cylinder that separates the end from the beginning has the effect of a divide- 
by-C counter where C is the circumference of the cylinder. The counter is incremented 
every time the line is crossed to enter the beginning from the end. Hence we get the counter, 
(i), which allows us to keep track of which nodes belong to any particular graph in 


execution. [Little91]. 





Figure 7: A visualization of the graph of Figure 6 executing on a 
Revolving Cylinder [Akin 93]. 


Figure 7 is an illustration of the schedule of Table 3 mapped to a Revolving 
Cylinder. The transient start-up cost of the schedule is prohibitive and seems 


disproportionate were the application executed only once or twice. The benefit comes once 
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the machine gets past the 7th cycle. The run-time enforcement of this mapping ameliorates 
the nondeterministic output rates of data-flow graphs. It is readily apparent that, although 
each instance still takes 12 cycles, the system will complete an application every six cycles 
and thus reach the full potential offered by two processors. In this the Cylinder operates 


much like asynchronous pipelining on a control-flow machine [Akin 93]. 


Each slot in the cylinder is of width equal to the smallest node in the graph. For 
each node in the graph, starting with the top node (in our example, A) and working towards 
the bottom node (F), attempt to schedule the node at its earliest start time. If 1t cannot be 
inserted at start time, delay the start time by the width of a slot and repeat until it can be 
inserted. Adjust the earliest start time of all descendants of that node and repeat the 
sequence with the next node as the top node in the graph. In the same way that delays helped 
in the previous section this mapping ensures that maximum cylinder usage (and hence 


throughput) will result. 


2. Assigning Scheduling Arcs in The Graph 
Once all nodes have been inserted into the cylinder and the cylinder 1s full, assign 
arcs to the nodes based upon their location in the cylinder. For each entry mapped to an AP 
in the cylinder, if there are other nodes assigned to the same AP with the same index and 
the node higher up in the cylinder is not an ancestor of the other, then create a dependency 
from the higher node to the lower. The restructuring of the graph in the example is not 
unique. There are several ways of filling the table and so there are corresponding sets of 


additional dependency arcs. 


ah 


The data-flow graph of Figure 5 


with dependencies added. 





Figure 8: Graph of Figure 5 with added scheduling arcs [Little 91]. 


Even for a single assignment, there exist several sets of additional dependencies. 
This introduces the problem of selecting the best assignment and a suitable set of arcs 
associated with it for some arbitrary graph. The heuristic used for such selection is 
minimization of the number of additional arcs introduced. Figure 8 shows one possible 
restructuring resulting from this technique. 

The run-time mechanism of the scheduler is fixed and thus any execution 
sequence enforcement is accomplished at compile-time. The grey lines in Figure 8 show 
the additional data-dependencies used to enforce RC assignment at run-time. Each grey line 
represents a queue of tokens generated by the source and absorbed by the destination. Each 
source generates a single token when it completes execution. The 2-tuple associated with 
each indicates the threshold and consume amounts for the control token flow on these arcs. 
The threshold amount refers to the number of tokens that must be present on the arc for its 


destination node to be eligible for execution. The consume amount refers to the number of 
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tokens removed from the arc when it executes once. Thus, the arc from B to C forces node 
C to delay going onto the Node Ready List until node B is complete. This has the same 
effect as specifying delays without actually scheduling them into the application graph. 
Given such restructuring, the setup and breakdown times for arcs (A,B), (B,D) 
(A,C) and (E,F) can be minimized. This is done by chaining sequential nodes which feed 
directly into each other. The nodes are collapsed into a single node for assignment to a 
single processor. The trade-off becomes one of reduced overheads for communication 
versus loss of parallelism and throughput gains. The flexibility of the system’s granularity 
enables the system to make this choice effectively. It is assumed that the overhead of 
implementing the control-token queues is negligible with respect to the cost of 


implementing data queues [Levine 92]. 


D. Framework for Comparison 

Based on whether a scheduling decision is made at compile-time or at run-time we can 
classify a data-flow implementation over a spectrum that ranges from fully static to fully 
dynamic. Dynamic implementations have the most management and communication 
overhead but this makes them more flexible and easier to implement than a static 
implementation. They have the added benefit of being more robust in the case where a 
processor malfunctions and so degrade gracefully. 

To their credit, static implementations are more efficient and have the predictable 
performance crucial to a real-time system. They are, however, difficult to realize, 
inflexible, and degrade poorly. Their effectiveness is determined by how accurately the 
computational problem is known before-hand. This is a difficult problem and typically the 
worst-case estimate results in large inefficiencies. 

A carefully implemented hybrid of compile-time effort and run-time complexity 
strikes the appropriate balance between throughput and guaranteed performance. RC 


analysis provides such a blend by building scheduling management into the graph at 
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compile-time and then allowing the run-time scheduler to assign nodes to processors 
dynamically. 

A node is synchronous if we know, a priori, how many new input samples are 
consumed and how many output samples are produced every time a node is invoked. A 
Synchronous Data-Flow graph 1s a directed acyclic graph made up of synchronous nodes 
[Lee 87]. Revolving Cylinder analysis is most suited for use with synchronous data-flow 
graphs. 

The RC technique is directed towards improving throughput and the determinism of 
output flow in real-time systems, under high loads, with repetitive tasks to perform. Tasks 
that fall in this category are those such as radar, Magnetic Resonance Imaging, and other 
continuous scan applications. Other real-time scheduling systems are concerned with 
getting the fastest possible response without regard to how efficient the continued 
execution of the task might be. These fall under the guise of some weapon systems 
applications in which instant response is required from a single instance of an application. 
These schedulers seek to pack an application graph so that it will run in the least possible 
number of cycles. 

The system we use 1s non-preemptive. Enough research 1s available in the literature to 
obviate an extended discussion of this thesis. Suffice it to say that the graph’s inherent 
structure implies nodal orders of execution. This, combined with known node execution 
times, leads to more deterministic output flow than a preemptive scheduling scheme. 


Figure 9 illustrates the difference between the two [Lewis p.249]: 
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(b) 


(a) A  non-preemptive and (b) 
preemptive schedule for 3 tasks with an 
execution time of 2 cycles 





Figure 9: Preemptive vs. non-preemptive scheduling. 


Revolving cylinder analysis is a policy which can be implemented on a number of 
different machines. The key 1s that it improves the determinism of output flows whenever 
there are repetitive tasks whose executions are deterministic. It does this by a mix of Static 
scheduling and dynamic assignment of nodes to processors at run time. We are interested 
in the approaches used by other researchers in the field of real-time scheduling. Chapter IV 


covers these in detail. 
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IV. ALTERNATE APPROACHES 


We now look at data-flow graph scheduling techniques ranging from the scheduling 
approach used to implement real-time prototypes on the Naval Postgraduate School’s 
Computer Aided Prototyping System (CAPS) to Som’s multiprocessor “Algorithm To 
Architecture Mapping Model’. Each of these seeks to improve real-time performance of 
systems using directed acyclic graphs. The target architectures vary from simple control 
flow von Neumann machines to a SSIMD architecture. This chapter covers the approaches 


in depth and discusses the strengths and weaknesses of each. 


A. Scheduling Hard Real-Time Systems on CAPS 


1. An Introduction to CAPS 

The Computer Aided Prototyping System (CAPS) being developed at the Naval 
Postgraduate School seeks to overcome the complexity in the design and development of 
hard real-time environments using rapid prototyping to build and maintain these systems 
[Levine 91]. Rapid prototyping is a means for stabilizing and validating the requirements 
for complex systems (e.g. embedded control systems with hard real-time constraints) by 
helping the customer visualize system behavior prior to detailed implementation. CAPS 
supports an iterative prototyping process characterized by exploratory design and extensive 
prototype evolution, thus enabling engineers to produce complex systems that match user 


needs and reduce the need for expensive modifications after delivery [Levine 92]. 


Zs System Overview 
CAPS consists of several modules. Figure 10 describes the major software 
modules of the system.The first module of the system is the user interface which consists 
of a graphical editor for the formal prototyping language called Prototyping System 


Description Language (PSDL). The second module is the Software Database System which 
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includes the Rewrite Subsystems, the Software Design Management Subsystem, and the 


Reusable Software Component Database. 


Software Database] | E tion S t 
a 


Figure 10: CAPS modules [Levine 91]. 
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Figure 11: The Execution Support System (ESS) [Levine 91]. 
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The third module is the Execution Support System (ESS). This module contains 
the PSDL Translator, the Static Scheduler, and the Dynamic Scheduler. Figure 11 shows 
the implementation and interfaces of the ESS.The Dynamic Scheduler acts as a run-time 
executive when exercising the system. It schedules nodes without timing constraints, which 
are not included in the static schedule, by using spare capacity or slack in the static schedule 
(see Figure 14). It handles run-time exceptions and hardware/operator interrupts and 
communicates with the user interface during prototype runs. Thus it performs like a 
miniature operating system. 

It is the static scheduler that we are interested in. The purpose of the static 
scheduler is to build a static schedule for a set of tasks that must obey both precedence and 
timing constraints. This schedule gives the order of execution and the timing of the 
operators. It is legal and feasible if both precedence relationships are maintained and timing 


constraints are guaranteed to be met. 


3. The Static Scheduler 
The static scheduler has five modules: PSPDL READER, FILE PROCESSOR, 
TOPOLOGICAL SORTER, HARMONIC BLOCK BUILDER, and OPERATOR 
SCHEDULER. 

The first component, PSDL READER, reads and processes the PSDL 
prototyping program. It is essentially a filter that removes information not needed by the 
static scheduler. 

The second, FILE PROCESSOR, analyzes the text file generated by reader and 
separates the information into a linked list data structure and a file of non-critical nodes. It 
then converts sporadic operators into their periodic equivalents. The block builder and the 
operator scheduler generate linked lists containing the vertices and links of the graph. 

The third component, TOPOLOGICAL SORTER, performs a topological sort 


on the data structure. It develops a true topological ordering and is not dependent on a 
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specific ordering of nodes in the PSDL input file. The result 1s a total ordering of the nodes 
depending on data flow. 

The fourth component, HARMONIC BLOCK BUILDER, determines the 
Harmonic Block length of the static schedule. An illustration of the Harmonic Block is 
found in Figure 14. The system takes each of its real time processes and finds their least 
common multiple. This guarantees that the system will schedule and execute each critical 
process within the bounds of performance. The trick is to find a harmonic block which will 
meet the performance constraints of a real-time system. 

The last module, OPERATOR_SCHEDULER, combines the output of 
TOPOLOGICAL_SORTER, FILE PROCESSOR, and HARMONIC BLOCK BUILDER 
to produce a Static schedule. The static schedule is a linear table giving the exact execution 
Start time for each time-critical node and the reserved maximum execution time (MET) for 


each. 


4. Graph Implementation 
The nodes are atomic or composite. An atomic node is defined as the basic 
individual unit of work to be executed and a composite node is defined as being a node that 
can be decomposed into atomic nodes. This allows the system to deal with varying 
granularity. Each node is characterized by its timing constraints, precedence constraints, 
and resource constraints. The researchers assume that the resource requirements for each 
node, to include memory and external systems, are always met. 

There are two different types of data in PSDL: discrete and continuously 
sampled streams. Discrete data are used in applications where the values of data must not 
be lost/replicated and in which the period of the producer and consumer of the data must be 
the same (lockstep performance). Sampled data are used in applications where values must 
be available at all times and can be replicated without affecting their meaning. Each data 


stream represents a directed edge from the node that produces the data to the node that 
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consumes the data in the precedence graph. Cycles, and hence internode recursions, are not 


permitted in the precedence graph. 
5. Creating the Schedule 


a. Algorithm Options 

After creating a constraint graph the static scheduler creates a schedule 
using one of the following algorithms: Earliest Start Time, Exhaustive Enumeration, or 
Simulated Annealing. Since the static scheduling problem is NP-hard, systemic global 
search is the only guaranteed way to return a feasible static schedule for a hard real-time 
system if such a schedule exists. The exhaustive enumeration algorithm is implemented in 

CAPS to accomplish this, but the algorithm is very costly in practice. 
Shing and Levine [Levine 92] developed a simulated annealing approach as a heuristic 
algorithm to schedule the prototypes of hard real-time systems. The goal of this algorithm 
is to quickly find a valid schedule if one exists in a majority of cases where the cost of 


complete enumeration is too great. 


b. Simulated Annealing 

The simulated annealing procedure was chosen because it was iterative, 
probabilistic, simple and insensitive to the form of the cost function. An example 
combinatorial optimization problem is an assignment problem where there are a number of 
personnel available to do an equal number of jobs. The cost for each person to do each job 
is known. The goal is to assign each person to a job so that the total cost is as small as 
possible. There are a wide range of combinatorial optimization problems in a similar vein 
for which simulated annealing is tractable. These include graph partitioning, graph 
coloring, number partitioning, VLSI design, and travelling salesman type problems [Levine 


eZ 
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6. Basis of The Algorithm 


Simulated annealing is based on the behavior of physical systems. The approach 
is modelled on the way that liquids freeze and metals crystallize. At high temperature, 
molecules move freely with respect to one another. As the liquid cools, this mobility is lost. 
Atoms line up and form a pure crystal that is at a minimum energy level. As the system 


cools it tends toward a state of minimum potential energy. 


7. Annealing and Optimization 


Examining simulated annealing in non-physical terms, a comparison 1s made to 
the concept of local optimization or iterative improvement. Local optimization repeatedly 
improves an initial solution until no further improvement of the solution is possible. This 
is known as iterative improvement or “hill climbing.” Simulated annealing differs from 
local optimization in that random uphill movements (acceptance of a worse Solution) are 


permitted while the system “temperature” 1s warm enough to allow it. 
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Allows the solution _ 
to get over this potential 


barrier 


Decreasing Temperature over time 





Figure 12: A representation of a simulated annealing solution’s 
cost over increasing time [Levine 91]. 
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This prevents the algorithm from being trapped in a poor locally optimal solution 
as demonstrated in Figure 4.3. Simulated annealing provides significantly better results 
than can be found utilizing local optimization. 

The key to the use of the simulated annealing approach to solving combinatorial 
optimization problems is the random acceptance of worse iterative solutions. Initially when 
the system is in a high energy state (high temperature), the probability is greater that a 
worse iterative solution is accepted. As the system cools this probability decreases, but 
even at the lower energy states the probability for making an uphill move still exists. Uphill 
moves allow the algorithm to leave a poor local solution and reach a better solution. This 
general scheme of always taking a downhill step while occasionally taking an uphill step is 


known as the Metropolis algorithm [Levine 91]. 


8. The Cost Function 
The choice of a probability function to determine if an uphill movement 1s 
allowed is an important consideration. At each step of the simulated annealing algorithm a 
new State is constructed based on the current state. This new state 1s constructed by 
displacing or adjusting a randomly selected element. If this new state has a lower cost than 
the current state, the new state is accepted as the current state. If the new state has a higher 
cost than the current state, the new state is accepted with the probability: 
exp(-Ae/kT) 
This function 1s known as the Boltzman probability distribution where: 
Ae = difference in cost between new State and current state 
k = Boltzman's constant of nature relating temperature to energy 
T = Current Temperature 
A characteristic of this probability function is that at very high 
temperatures every new State has an almost even chance of being accepted as the current 


state. At low temperatures the states with a lower cost have a higher probability of being 
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accepted as the current state. Simulated annealing is simple to implement and can be 


applied to a variety of combinatorial optimization problems. 


9. Real-Time Scheduling Constraints 
Developing hard real-time schedules using simulated annealing requires that 
several modifications must be made to the steps of the simulated annealing algorithm. 
These changes are required because true random orderings of graph nodes cannot be 
maintained since there are precedence constraints in a hard real-time schedule. Another 
change to the algorithm is that hard real-time scheduling only seeks a feasible schedule, not 
the best possible or optimum schedule. This factor simplifies and speeds up the 


development of the annealed schedule. 


Old Ordering New Ordering 
op_l op_l 
op_2 op_3 
op_3 op_2 
op_4 op_4 


Reordering of nodes preserving precedence 





Figure 13: Reordering of nodes using CAPS scheduler [Levine 91]. 


The method of adjusting a given solution maintains the precedence relationships 
that exist between operators of a hard real-time system’s application graph. As long as 
precedence is maintained nodes can be adjusted randomly within a given schedule. True 


random orderings cannot occur since a parent must always appear before its children. 


33 


Figure 13 demonstrates a feasible reordering of nodes that can occur using simulated 
annealing. 

In both the old and the new ordering, the position of each operator in the list is 
valid based on the precedence relationship indicated by the graph. The algorithm 
guarantees precedence by checking for a parent-child relationship between nodes it is 
attempting to reschedule. The goal of the hard real-time scheduler is to find a feasible 
schedule for the graph, not the optimum schedule. This means that the search for a schedule 
is terminated as soon as a feasible schedule is found. Both loops of the annealing algorithm 
are modified so that if a feasible schedule is found, the loop condition for both loops is 


satisfied and annealing is terminated. 


10. Solution Deadlines 
Each proposed solution, including the initial solution, is examined to see if it 


satisfies two Criteria: 


(1) Examine each node's start time. 
The start time must be examined to see if any node starts before its earliest 


allowable start time. 


(2) The finish time is then examined to see if 1t exceeds the upper bound 
for node termination. 
If the upper bound for a node is violated, the amount of time that this bound is 


violated will be added into the schedule’s cost. 


a. Precedence 


There is no requirement to examine a schedule to see that precedence is 
maintained since each adjustment to the schedule will guarantee that operator precedence 


1S maintained. 
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b. Harmonic Block Length 


The proposed schedule must also be examined to check that the finish time 
of the last operator in the schedule does not exceed the harmonic block length. A harmonic 
block 1s defined as a Set of periodic operators, where the periods of all component operators 
are exact multiples of the base period. The base period is the greatest common divisor of 
all periods of the critical periodic operators and the harmonic block length is the least 
common multiple of these operators as in Figure 14. The basic idea is that a schedule is 
developed to fit inside a harmonic block. Once a schedule is developed that fits within the 
harmonic block, subsequent copies of the block can be made to maintain the hard real-time 


schedule [Levine 92]. 


Pl P2 


Harmonic > 
Block = 


(2x5) = 10 time units 


Pl must occur every 5 units 
P2 must occur every 2 units 
The harmonic block seeks to ensure that execution 1s guaranteed within these const 





Figure 14: Harmonic Block length in CAPS 


If a schedule does exceed the harmonic block length, it cannot be valid 
because subsequent copies of the schedule will also violate their timing constraints. If the 


schedule satisfies all timing constraints and the harmonic block length is not violated then 
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itis feasible. At this point the simulated annealing algorithm is terminated and the schedule 
is returned to CAPS. 

The intent of scheduling real-time systems on CAPS is to guarantee the 
execution of tasks on a serial processor within a specified time bound. Thus the harmonic 
block ensures time for each critical process. If a feasible schedule is found (1.e., a harmonic 
block which satisfies time and execution constraints) the system is going to guarantee that 


a real-time application will execute within its bounds. 


System Kernel has priorit 
What was slack 1s now used by over all processes - disrup 


FCFS non-critical processes determinism of execution 
and output rates 


de ee! P2 KernelinterruptP1 P2 P2 
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Figure 15: Kernel interrupts and non-critical processes 


Real-time processes are given a higher priority than non-critical tasks and 
so execute within the bounds of the harmonic block. The system handles data arrival both 
periodically and aperiodically by the use of interrupts and polling. Aperiodic data arrival 
means that interrupts are necessitated by the arrival of critical tasks with higher priority 
than an non-critical task currently executing on the CPU. Polling is used to handle the 
execution of queue of non-critical tasks waiting for slack in the execution of the harmonic 


block. In periodic operation the system only has to handle the task of polling each of the 
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non-critical processes competing for system resources. Real-time processes are guaranteed 
processor time by the Harmonic Block and need no polling. 

One of the problems of the system is that the kernel has priority over all 
tasks as shown in Figure 15. In this example the Harmonic Block of Figure 14 is interrupted 
by a system call. Once the system call is finished the scheduler crams processes into the 
block to try and make execution time limits. even if a critical task is in the middle of 
execution then it is preempted by any kernel calls. A statistical analysis can determine the 
frequency of these interrupts but there is still non-determinism in the schedule’s output 
flow. Another potential problem lies in the inherently non-deterministic output flow of 
ADA. There is no way to guarantee performance of the system when no time bounds are 
guaranteed on the connection interface of ADA sockets, etc. This is a temporary problem 
being addressed in the next versions of the language but it does bear inspection. More 


information on the approach is available in [Levine 92]. 


B. Scheduling for Real-Time DSP Performance on a Rectangular Grid 

Lincoln Laboratory of M.I.T. developed a Block Diagram Compiler (BDC) designed 
and implemented for converting graphic block diagram descriptions of signal processing 
tasks into source code to be executed on a Multiple Instruction - Multiple Data Stream 
(MIMD) array computer [Ziss 87]. The compiler takes a block diagram of a real-time DSP 
application as input entered from a graphics workstation. It then translates the graph 
representation into code for the target multiprocessor array. The current implementation 
produces code for a rectangular grid of Texas Instruments TMS32010 signal processors 
built at Lincoln Laboratory but the concept can be extended to other processors or 


geometries. 


i. Target Hardware Implementation 


The current hardware implementation of the MIMD array consists of a two- 


dimensional rectangular grid of TMS32010-based processing cells. The size and shape of 
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the array is somewhat arbitrary with the restriction that one cell can be nearest-neighbor to 
no more than four other cells. Enough communications paths exist in this array to allow it 


to function as both a 4x4 square grid and a 16x] linear array (Figure 16). 


16 processor array 

Current configuration has enough 
communications paths to allow it 
to act as either 1x16 

or 4x4 array. : 





Figure 16: Lincoln’s variable geometry MIMD machine [Ziss 87]. 


2. | Mapping The Graph Nodes to Processors 


a. Individual Processors 
A user begins by drawing a block diagram of his application using a library 
of basic DSP functions implemented as nodes. The nodes can be as simple as adders and 
multipliers or as complex as FFT’s. Processor assignment is done either manually or by the 


task-assignment module. In other words, the application nodes are scheduled statically. The 
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problem of mapping nodes to processors is similar to that encountered by data-flow 
architectures. 

The Lincoln architecture relies on special hardware to track the availability 
of data. This approach uses the Lincoln machine’s hardware FIFO queues and the 
efficiency gains offered by processor locality. Figure 17 illustrates the design of a single 
TMS 32010 processor. Data-flow concepts could be simulated in the object code but this 
imposes a heavy communications overhead contrary to the real-time processing 


requirements of the system. 


to north neighbor 


16bit data bus 


to west neighbor to east neighbor 
FIFO TMS32010 Port | 
4k word externa 
12 bit address bus 


Port 3 


FIFO queues on two ports allow 2 transceiver(1, 0) 
and 2 receiver ports (2, 3). This allows 


asynchronous interprocessor communications : 
: to south neighbor 





Figure 17: Texas Instrument TMS 32010 DSP [Ziss 87]. 


b. Entering/scheduling an application graph 
Block Diagram Compilers are normally used as parts of simulation 
languages for digital signal processing. The Lincoln approach differs in two ways. First, it 
uses a graphic input interface to enter the application to the machine. The second difference 
is that instead of providing simulation code for a general purpose computer the compiler 


directly produces efficient object code to run in real-time on a MIMD array. When the 
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system schedules nodes statically it takes the physical arrangement of nodes and their 
processors into account. The compiler takes a graphical representation of a real-time DSP 
application and translates it into efficient assembly language code for each processor. 

MIMD systems are often difficult to program as the programmer must 

(1) partition the problem among the processors, 

(2) route the interprocessor data transfers, and 

(3) write different code for each processor in the array. 

The system is designed to perform these three steps automatically. Signal 
processing problems usually have enough inherent structure to allow efficient mapping 
onto a MIMD array. The structure typically takes the form of parallelism and pipelining and 
is well represented by a directed graph. As a result the system can use an application’s 


graph representation as high level compiler input. 


c. Node assignment 

Nodes assigned to the same processor are linked by common memory 
locations within the processor. I/O routines are created to transfer data between nodes in 
different processors. If the terminals of an interprocessor data transfer are assigned to 
adjacent processors, the routing is trivial. If the two processors are not adjacent, store-and- 
forward routines are generated for the intermediate processors, yielding a simple packet 
network. 

The development of the compiler was eased by the choice of an 
asynchronous MIMD array hardware target. Because intercell data transfers are designed 
to be asynchronous the need for BDC software for insuring lock-step synchronous transfers 
between cells was obviated. Thus, the TMS32010 assembly code controlling I/0 transfers 
became simple to implement because hardware handles most of the data availability 


overhead. 
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3. Problem Partitioning and Task Assignment 
Given a specification of the signal processing operations by a block diagram, 
the components of this specification, the nodes, must be assigned to individual processors 
in the array. Atits simplest level, the structure of the array makes it possible for any block 
to be assigned to any processor and have the appropriate signal paths routed between 


processors. 


a. Assessing Assignment quality 

While this simplistic assignment strategy might suffice for uncomplicated 
Situations it begs the question during high system utilization. Figure 19 illustrates the 
random assignment of the simple graph in Figure 18. In this case we see the high 
Communications overhead if assignments are not chosen with respect to locality. Operator 
] is assigned to a random processor, as are the others. Communications from OP1 (heavy 
black arrows) traverse a circuitous path to get to OP2 and OP3. Results from OP2 
(horizontal stripes) and OP3 (hashed arrows) then wend their way to OP4. Obviously, 
criteria need to be established and enforced to assess and then ensure the quality of each 


assignment. 


A simple data-flow graph 





Figure 18: An example data-flow graph 
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For example, the lack of a global memory demands that the memory 
Capacity of each processor may not be exceeded. An intuitively appealing criterion, as 
opposed to a constraint, is to minimize the number of processors used in the assignment. 
This global criterion is used to reduce the complexity and emphasize the conciseness of an 
assignment. These requirements must be taken into account both to make a reasonable 


assignment of nodes to processors and to assess the quality of the assignment. 
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Figure 19: Arbitrary assignment of graph 


b. Optimizing the Assignment 


To achieve an assignment of signal processing components to 


computational processors that satisfied a set of both local and global criteria an 
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optimization problem was defined with a cost function which reflected these requirements. 
The independent variables over which the optimization was performed are the processor 
assignments for the nodes and signal routes through the array. These variables are 
fundamentally discrete; thus, optimization procedures that require the evaluation of a 


derivative could not be used. Instead, a combinatorial optimization procedure 1s necessary. 


4. Algorithm Description 
Simulated annealing was chosen because it answered the need for an optimized 
solution of discrete variables. It can be specified by identifying a set of solutions together 
with a cost function that applies a value to each solution. There exists an optimum solution 
which has the minimum cost possible. There may, of course, be more than one optimum 
solution. The Algorithm is the same as described in the previous section on CAPS with the 
exception that the grid architecture has different costs to optimize. The main local and 


global costs are summarized below: 


a. Chosen Local Functions: 


(1) Memory--The memory (Mreq) required for computations is 
evaluated for each processor. If this amountis less than 90% of the total available (Mavail) 


the cost function is zero. If greater than this number, the cost function equaled: 


ae 
Mreq (0.9 x Mavail) 


As the TM532010 has separate program and data memories, the memory cost function 


was evaluated for each and summed. 


iz Real-Time--A cost function similar to that used in the memory 
usage was used to assess computational requirements. The number of cycles required by all 
of the blocks assigned to a processor were summed. If less than 90% of the total available 


time, the cost function 1s zero; if greater, the cost function assumes a quadratic form. 
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(3) Input/Output--In addition to the impact of signal computations on the 
memory and processing power of a single processor, the assignment of computational 
demand by the input/output programs required by the signal routing mechanisms 1s also 
made. The memory and computations required to support the routing are included in 


determining the memory and real-time components of the cost function. 


(4) Special Capabilities--Several processors have special “capabilities”’ 
that distinguished them from the others. For example, only one processor had an A/D and 
D/A converter and another had the host-network interface. A subtle capability that is 
common to all processors is their presence. The processor array is assumed _ to bea 
rectangular grid, with some of the grid points having no processor. This capability allows 
the specification of no longer functioning processors and irregular geometries. Those 
blocks in the original block diagram requiring these capabilities are noted. If such a block 
is assigned to a processor lacking a specific capability, this component of the cost function 


is given a large non-zero constant. 


b. Chosen Global Functions: 


(1) The length of each signal route is measured in terms of the number of 
intervening processors. If this signal is not involved in a feedback loop, two times the 
length of the signal route is added to the cost function. If part of a feedback loop, ten times 
the length is added. This component of the cost function has the effect of reducing the 
number of processors used to support signal processing. Because of their inefficiency, 
feedback loops are especially penalized so that the components of each loop are kept 
physically close. If possible they are mapped to the same processor. Figure 20 illustrates a 


possible new assignment of a simple graph with these overheads taken into consideration. 


Assignment maps nodes physically close 
in order to limit length of 
communications paths. If possible 
it will collapse nodes into a single 
processor. 





Figure 20: Optimized static assignment of nodes to processors 


(2) Inthe context of simulated annealing a perturbation of the assignment 
of nodes and signal routing 1s made. With probability 1/4, a node 1s randomly assigned to 
another processor in the array and the attached signals rerouted. With probability 3/4, a 
signal is chosen randomly and a different routing for the signal made. The routing 
algorithm has probabilistic aspects as well. A small number of random routings between 
the two processors containing the signal routing components are made and the one having 
the smallest length chosen as the new routing. If a signal does not require interprocessor 
routing (1.e.: The nodes are assigned to the same processor) the intraprocessor routing is 


always chosen. 
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With these definitions of cost function and of what constitutes a random 
perturbation, the simulated annealing algorithm requires several thousand iterations to 
determine the optimal assignment. The “temperature” is reduced geometrically at each 
iteration (a reduction of about 0.9995 is used). The initial value of the temperature is equal 
to twice the maximum change of the cost function when ten random trial assignments are 
made: typically, this value is several hundred “degrees”. The terminating threshold value 
for the temperature is fixed atQ.1. Although the minimum cost assignment 1s not always 
found, the real-time and memory constraints are always met. Typically, sub-optimum 
results have inefficient signal routes. 

The intent of the BDC and the array is to bring a real-time environment to 
applications too large for a single processor, but without the detailed programming often 
required for parallel computation. Real-time performance is not obtained by assigning 
each node to its own processor and having a compiler determine an optimal signal routing 
but instead by having the program for each processor consist of tightly coupled, efficiently 
debugged program modules with a minimum of interprocessor computation. 

MIMD architectures are more general than other multiprocessors. Despite their 
usual synchronization overheads they can be used to advantage with data-flow and large 
grain computation [Lewis, p.210]. The approach used in the TMS machine allows some 
asynchronous operation and so eases the control overhead faced in synchronous machines. 
There are other benefits as well. The use of a grid with specifiable processor degradation 
yields an architecture that fails more gracefully than a synchronous machine in the event of 
processor failure or system error. 

The distributed memory of the architecture does impose global limits on the 
memory capacity of the machine and so limits its flexibility. Another shortcoming is that 
there is no code optimization for groups of programs chained onto a single processor. 
Nonetheless, The Lincoln machine gives us insight as to how a heuristic algorithm can be 
used to statically schedule a graph for real-time on a MIMD array. Further information can 
be found in [Ziss. 87]. 
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C. Optimal Implementation of Flow Graphs on SSIMD Multiprocessors. 


The next approach we discuss was developed by Barnwell and Schwarz [Barnwell 84] 
at the Georgia Institute of Technology. It is a general technique for the implementation of 
recursive and nonrecursive signal flow graphs and other arithmetic algorithms on 


synchronous digital machines composed of many identical programmable processors. 


1. Optimality 

Barnwell [Barnwell 84] defines three different categories of optimality: An 
implementation is said to be rate optimal if it achieves its sampling, or input rate, bound. It 
is delay optimal if it does not exceed its delay, or output rate, bound. Lastly, it 1s processor 
optimal if it exhibits perfect processor efficiency such that every cycle of every processor 
is used directly on the fundamental operations of the algorithm and no cycles are used for 
synchronization or systems control. These definitions are not mutually exclusive and any 
implementation could satisfy the criteria. 

The Georgia Tech approach is characterized by two fundamental properties: 

First, it uses the Skewed Single Instruction Multiple Data (SSIMD) mode in 
which exactly the same program 1s executed on all the processors, and that program is an 
exact single processor realization of the entire algorithm being implemented. 

Second, all the data precedence relations among the processors are automatically 
maintained by the inherent synchrony of the system. This often results in processor- 
optimum solutions in which the use of M processors leads exactly to an M-fold increase in 
the system throughput. 

These techniques result in a procedure in which the algorithm is specified in 
some simple notation, such as a set of difference equations, and from this a completely 
parallel multiprocessor implementation for the algorithm is generated. 

The resulting implementation is always either processor-optimum or time- 
optimum in which case the absolute throughput limit for the technique has been reached. 


In addition, for a large class of recursive signal flow graphs, the implementations are 
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absolutely optimum in the sense that there is no other implementation for a particular signal 
flow graph and a particular constituent processor. The approaches discussed here have been 


tested on a synchronous multiprocessor system. 


2. TheSSIMD Mode 

The fundamental computational mode which is utilized in these implementations 
is the Skewed Single Instruction Multiple Data Mode. In this mode, exactly the same 
instruction stream is executed on all processors, but with a fixed time skew maintained 
between the instruction execution times and the separate processors. The program realizes 
exactly one time-iteration of the flow graph. Figure 21 illustrates a Digital signal flow 
representation and a single processor realization of the same. 

In a single processor realization, none of the delay elements are realized directly, 
but rather the output from each delay element becomes an input to the program and the 
input to each delay element becomes an output of the program. In the SSIMD realization, 
these delayed values are not computed by this processor, but are supplied from identical 


computations on other processors. 


x(n) 


Signal flow graph for a second order recursive I(1) I(2) 

Direct form II digital filter Single processor realization of the digital 
filter. All delays are not implemented by 
the program but are realized by the 
parallel structure 





Figure 21: Recursive digital filter flow graph [Barnwell 82]. 
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Figures 22 and 23 show a single processor and a two processor SSIMD 
realization for the signal flow graph of Figure 21. In the single processor solution of Figure 
22, all of the past values of r(n) are supplied by the same processor, and there is never an 
issue of data availability. In the two processor realization of Figure 23, alternate points are 
supplied by each processor, and the two processors must be skewed such that the data 


requirements of each 1s always met by the other. 


In a single processor SSIMD 
realization, all recursive outputs 
are supplied by the same processor 





Figure 22: SSIMD single processor realization of a recursive filter 
[Barnwell 82]. 


In multiple processor realization, recursive outputs 
supplied by another processor 





Figure 23: Multiple processor realization of the recursive filter of 
Figure 21 [Barnwell 82}. 
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It should be noted that these SSIMD solutions are “free running” such that 
whenever a processor completes the computations associated with one time index, it 
immediately begins the computations associated with another time index. Hence each 
program realizes an infinite loop and, under the assumption that the program timings are 
not data dependent, each loop takes exactly the same amount of time to execute. Thus, if M 
processors are started at times t(m), 0<m<M-l, then the relative time skews so imposed 
remain fixed until the programs are halted externally. 

The problem of implementing a particular iterative arithmetic program reduces 
to specifying the M starting times, t(Q)...t(@m-1), such that all the data available for the 


various computations 1s available before it is needed. 


3. Implementing Recursive Arithmetic Programs 

The problem of implementing a particular recursive signal flow graph in SSIMD 
mode can be divided into two related problems. The first is the problem of finding and 
characterizing all legal SSIMD solutions for a particular single processor program for 
implementing the signal flow graph. The second problem is that of constructing the 
particular single processor program such that the eventual SSIMD scheduling solution will 
be optimum. This section addresses the first problem for single input/single output signal 
flow graphs. These results are easily extended to multiple input/output systems. 

In fitting the programs together in SSIMD mode, the data which must be used 
include the length of the program, T, the times at which the delayed recursive inputs are 
first used, I(L)...10), for a system with longest delay and the time at which the recursive 
output is available, R. 

The first point to note is that all SSIMD solutions are bounded by the solution 
with equally spaced starting times. It can be proven that in SSIMD, the processors operate 
in a circular fashion, and the relationships between a single processor and its predecessors 


and successors in the processor chain are identical for all processors in the system. Any 
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advantage attained by local time perturbations at one processor would be lost at some other 
processor. Hence, all SSIMD solutions are bound by equally skewed solutions. 

Based on these results, four important features should be noted. 

First, given a single processor program for a signal flow graph (or other 
algorithms describable in a similar fashion), the maximum number of processors which can 
be used is immediately available and the starting times for the processors in SSIMD mode 
are simply computed. Hence, for a given program, the SSIMD implementation procedure 
1S very simple. 

Secondly, and more importantly, the maximum number of processors which can 
be used to advantage is a function of a single time index, I(I(x)), 1< 1< L, where L is the 
longest delay in the system. Hence, a simple constraint exists for optimizing a particular 
program for SSIMD implementation. The program is obtained by maximizing the 
minimum number of processors, M(1) which could be utilized on any arbitrary recursive 
input whose time of delay was the constraint on the system [Barnwell 82]. 

Third, and perhaps most importantly, the optimum time skew is not a function 
of either the program duration or the number of recursive inputs or outputs of the program. 
This allows for several important generalizations to be made and, for properly written 
programs, leads to impressive solutions. For example, the system of Figure 21 can typically 
be implemented with 8 or 9 processors even though it has only two recursive inputs. The 
throughput gains for a data-flow architecture working with recursion are immediate. 

Finally, it should be noted that there are no constraints at all if the algorithm is 
reconcursive. In a theoretical sense, this is a trivial statement, since it is clear that if there 
are no constraints on data availability, then any number of processors can be used to 
advantage. However, in an implementation sense the SSIMD approach still leads to elegant 


processor-optimum solutions for any number of processors. 
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4. Optimum Signal Flow Graph Implementation 

A study produced a set of systematic procedures for generating single 
processor programs which could produce optimum realization when utilized in the SSIMD 
mode. The problem addressed was how to proceed in an automated fashion from a simple 
representation of a signal flow graph, such as a set of difference equations or a matrix 
representation, to a single processor program which maximized the minimum value of 
M(l). This solution can be found by systematically investigating both the computational 
orderings of and at the nodes. It is easy to see that it can be accomplished inefficiently by 
an exhaustive search. 

The most important result, however, concerns the optimality of the SSIMD 
solutions. For a very large class of signal flow graphs, including both the normal and 
transposed forms of all direct form, cascade, and parallel digital filters, the SSIMD solution 
is absolute optimum in the sense that, for a particular constituent processor, it achieves the 
greatest possible throughput for the fewest possible processors. 

This can be illustrated in the context of the example of Figure 20. First note that 
in order maximize the number of processors used the quantity needed to make recursive 
feedback available must be minimized. This requires that each of the recursive delayed 
inputs, I(1) and 1(2) in Figure 22b, be first used as near in time to the completion of the 
computation of the recursive output, R, as possible. This leads to the general principal that 
when ordering the computations at a node, the delayed recursive inputs should be used last. 
This shows that the system throughput is not a function of the length of the program or the 
number of delayed inputs, but is only a function of the input/output time for one result and 
the time of one multiply/add operation. These are fundamental constraints of the processors 
themselves. 

Further, the output/input of a result and the multiply/add operations are the 
minimum possible required computations in a recursive signal flow graph. Since a single 


processor realization involves the fewest possible special (non-arithmetic) operations, it 
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also achieves this throughput with the fewest processors. These results can be generalized 
for a large class of recursive signal flow graphs, which lead to several important points. 

The first is that the SSIMD implementations are generally simpler than other 
multiprocessor options which typically include the parsing of the signal flow graph to 
promote local parallelism. By including everything needed for each instantiation of an 
application graph on each of the processors the overhead of interprocessor communications 
is minimized. This requires individual processors capable of handling the entire graph. 

The second important point is that all the limits on the number of processors and 
the throughput are a reflection of the recursive nature of the programs. As previously noted, 
if there is no recursion, then the solution is no longer constrained by the algorithm but rather 
by the nature of the hardware. 

The largest potential problem in SSIMD solutions concerns the inter- 
processor communication issues. Since the entire SSIMD development is done under the 
assumption that the processors can communicate “‘at will’, this would first appear to be a 
critical issue. It turns out, however, that it 1s not. This is true for two reasons. 

First, the fundamental periodicity of the SSIMD solution makes the 
communications requirements very uniform, which avoids many potential time conflicts. 
second, and most important, the nature of the communications environment can be 
systematically controlled. To see this, one simply needs to note that the number of 
processors with which a particular processor must communicate is controlled by the 
maximum length of the delay elements in the application graph. 

The use of long delay chains does improve the final solution since it leads to 
SSIMD realizations which require fewer processors to realize a time-optimum solution. 
But the entire procedure still works if the maximum delay length is constrained to be one. 
This 1s the case for the classical formulation for signal flow graphs. For such realizations, 
each processor only communicates with its two nearest neighbors, and communications are 


always unidirectional. Such realizations have the same maximum throughput rate, but, in 
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general, require more processors to achieve it. Most important, however, they have a 
communications environment which 1s always trivially implementable. 

SSIMD first fully distributes the algorithm in time because a separate time index 
is assigned to each processor. It then explicitly maps this time distribution into space. The 
difference between this and a systolic array is that a systolic implementation only maps the 
algorithm in space. The SSIMD approach is more processor and rate optimal than a systolic 
array. The primary advantage of SSIMD comes from the fact that instead of viewing the 
problem from the system clock, time is referenced at the individual processors and so a 
complex timing problem in the systolic array becomes a relatively simple one in the SSIMD 
paradigm. More information on the approach 1s available in [Barnwell 82] and [Barnwell 


84]. 


D. ATAMM: A Paradigm for Predictable Performance in Real-Time on 


Multiprocessors 


1. The Algorithm To Architecture Mapping Model (ATAMM) 

Som, Mielke, and Stoughton of Old Dominion University are working on the 
development of strategies for predictable performance in homogeneous multicomputer 
data-flow architectures operating in real-time [Som 90]. The approach 1s restricted to large- 
grained, decision-free applications such as the real-time implementation of control and 
signal processing algorithms. The mapping of such algorithms onto data-flow architectures 
is realized by a new marked graph model called ATAMM (Algorithm To Architecture 
Mapping Model). Algorithm performance and resource needs are determined for 
predictable periodic execution of algorithms. Predictability in performance and resource 
requirements is achieved by algorithm modification and input data injection control. 
Performance robustness is gracefully degraded to adapt in the event of decreasing numbers 
of resources. Two key areas the approach focuses on are as follows. 

First, the ability to achieve predictability of algorithm performance is considered 


to be the most important feature of real-time computing. It sometime is more important than 
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the actual performance. The design of such a system must have precise knowledge about 
the time of input arrival and output generation for the algorithm, not simply knowledge of 
statistics Concerning average performance. However, predicting algorithm performance 
accurately in multicomputer data-flow architectures is known to be very difficult as most 


scheduling problems in a multicomputer environment are NP hard. 


Directed Graph Tool 
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Figure 24: ATAMM Architecture [Som 90]. 


Second, very little research has been directed towards resource management in 
data-driven computing. The execution of algorithms must be controlled so that resource 
need does not exceed resource availability. Otherwise data packets experience unnecessary 
waiting times and require extra storage space, and system performance becomes 
unpredictable. 

This scenario 1s unacceptable in real-time computing with hard deadlines for 
outputs. New abstract computational models are required for real-time data driven 
computations which lead to algorithm performance and resource requirements that are 


predictable. On going research at Old Dominion University has led to the development of 
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a new marked graph model for describing the execution of algorithms in real-time data- 
flow architectures, ATAMM. 

The model specifies the criteria for a multicomputer operating system to achieve 
predictable performance within resource constraints. It represents the applications as 
directed acyclic graphs. The architecture is assumed to consist of two to twenty identical 
functional units or resources each having a capability of processing, communication, and 
memory. The number of algorithm nodes in a problem is not expected to be more than 
twenty. This range of functional units and algonthm nodes is selected due to the large- 
grained aspect of the target algorithms and knowledge of target architectures. 

The approach to achieving predictability in performance and resource 
requirements is to modify the algorithm graph and to control the input data injection rate so 
that a functional unit is always available for every enabled algonthm node. Algorithm 
performance is characterized by throughput and computing speed. When sufficient 
resources an available, the system executes algorithms with maximum throughput and 
maximum computing speed and the corresponding resource requirement is predicted. 

The programmer can develop strategies for graceful degradation in performance 
when only limited resources are available or when resources fail. The user is able to 
specify, off-line, trade-offs between decreasing throughput or decreasing computing speed 
with the help of a software design tool. The operating system is able to implement these 


changes on-line in real-time as the number of resources decreases. 


2. The ATAMM Model [Som 90] 

ATAMM describes algorithm execution on a data-flow architecture by three 
marked graphs, the algorithm marked graph (AMG), which is similar to the input graph 
used in RC scheduling, the node marked graph (NMG), and the computational marked 
graph (CMG). 
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a. Algorithm Marked Graph 


An algorithm marked graph is a marked graph which represents a 
decomposed algorithm. Transitions and places represent algorithm operations and 
operands respectively. The transition times represent the computational times required for 
the algorithm operations. The algorithm marked graph contains an edge (1, }) directed from 
node i to node j if the output of node 1 is an input for node j. Edge (i, j) is marked with a 
token if the output from node 1 1s available as an input to node j. All edges can have a queue 
and accommodate more than one token at a time. 

To illustrate the representation of a computational problem consider the 
directed graph in Figure 25. This input graph is used by ATAMM and 1s similar to that used 
in RC analysis. 


SOUTCEe 





Figure 25: ATAMM input graph (AMG) [Som 90] 
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b. Node Marked Graph 


NMG EDGE LABELS 


IF Input buffer full 
IE Input buffer empty 
DR Data read 


PC Process complete 
PR Process ready 

OE Output buffer empty 
OF Output buffer full 





Figure 26: A sample Node Marked Graph [Som 90]. 


The node marked graph (NMG) is a representation of the execution of a 
transition on the AMG by a functional unit. Three primary activities, reading of input data 
from memory (r), processing of input data to compute output data (p), and writing of 
output data to memory (w), are represented as transitions in the NMG. Data and control 
flow paths are represented as places, and the presence of signals is notated by tokens 
marking appropriate places. The NMG for an AMG transition is shown in Figure 26. 

The conditions for firing the process and write transitions of the NMG are 
as defined for a general Petri net, while the read transition has one additional condition for 
firing. A functional unit must be available for assignment to the algorithm operation before 
the read node can fire. Once assigned, the functional unit 1s used to implement the read, 
process, and write operations before being returned to a queue of available functional units. 
The initial marking for a NMG consists of a single token in the process ready place so that 
only one functional unit can work on an AMG transition at a time (static data-flow 
architecture). However, the Output Buffer Empty (OE) edge may contain a number of 
tokens so that the execution of an AMG transition can be repeated by another functional 
unit before the output is consumed. The total initial number of tokens on OE and OF edges 


is the size of the output queue in edge OF. 
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c. Computational Marked Graph 


The computational marked graph (CMG) is constructed from the AMG by 
replacing every transition by the corresponding NMG. AMG places are replaced by place 
pairs, a forward directed place representing data-flow and a backward directed place 
representing control flow. The performance measure TBIO (time between input and 
output) is the elapsed computing time between an algorithm input and the corresponding 
algorithm output. Therefore, TBIO is an indicator of computing speed. The algorithm- 
imposed lower bound for TBIO, denoted TBIO(Ib), is given by the sum of transition times 
for nodes contained in the longest directed path (critical path) from the input source to the 
output sink in the AMG. 

The performance measure TBO (time between outputs) is the elapsed 
computing time between successive algorithm outputs when the AMG is operating 
periodically at steady-state. Therefore, the inverse of TBO is an indication of output per 
unit time or throughput. The algorithm imposed lower bound for TBO (TBO(Ib)) is given 
by the largest time per token of all directed circuits in the CMG. A second bound on TBO 
isimposed by the availability of resources. The resource-imposed minimum value of TBO 
is given by TCE/R where TCE (total computing effort) is the summation of all the transition 


times of the AMG and R is the number of available processors. 


3. Injection Control 
When presented with continuously available input data packets, the natural 
behavior of a data-flow architecture results in operation where new data packets are 
acceptcd as rapidly as the available resources and the input node of the AMG permit. This 
leads to operation at a steady-state where TBIO > TBIO(Ib). This occurs because the 
pipeline from input to output becomes congested with extra data packets which must wait 
for free resources to be processed. From bounds on TBO, the output of the AMG cannot be 


generated at a rate higher than 1/TBO(Ib) or R/TCE. Injection control is a control procedure 
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which limits the maximum rate at which new input data packets can be injected from the 
source. Therefore, injection control eliminates data packet congestion and thus preserves 
operation at TBIO(Ib). 

Two diagrams which display graph play and are useful for determining the 
number of resources needed to achieve specified performance measures are the SPG and 
TGP. The SGP (Single Graph Play) diagram displays the execution of each node of the 
AMG as a function of time. The diagram is constructed for a single input data packet 
assuming availability of a resource for every enabled node. When several nodes are active 
at the same time, lines indicating node activity are stacked vertically so that computing 


concurtTency is apparent. a sample SGP diagram shown in Figure 27. 
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Figure 27: Simple Graph Play diagram for the graph of Figure 25 
[Som 90]. 

The resource requirements to execute a single data packet are obtained by 
counting the number of active nodes during each time interval in the SGP diagram. The 
peak resource requirement is denoted by Rmin and represents the minimum number of 
resources required to achieve SGP. As an example, Rmin is 4 for the graph of Figure 24. 

The TGP (total graph play) diagram is a diagram which displays the execution 


of each algorithm node when the algorithm is executed periodically in steady-state with 
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period TBO. As with SGP, the diagram is constructed under the assumption that a resource 
is available for every enabled node. The TGP diagram is drawn using information from the 
SGP. SGP is divided into segments of width TBO and these segments are overlaid to form 
TGP. Each segment from SGP represents a new input data packet. Data packets are 
numbered sequentially so that the packet numbered (i+1) is the data packet which is input 
to the algorithm TBO time units after the packet numbered i. To illustrate the construction 


of this diagram, TGP for the graph of Figure 24 is shown in Figure 27. 
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Figure 28: Total Graph Play of Figure 25’s graph [Som 90]. 


The resource requirements to execute multiple data packets injected with period 
TBO are obtained by counting the number of active nodes during each time interval in the 
TGP diagram. As TGP is periodic at steady state with period TBO, so also is the total 
resource requirement. The peak resource requirement necessary to execute the graph 
periodically with period greater than or equal to TBO is denoted Rmax. 

Rmax is determined by finding the largest resource requirement in all TGP 


diagrams drawn for injection intervals greater than or equal to TBO. For example, the TGP 
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diagram drawn for TBO = TBOlb = 2 shown in Figure 27 indicates that a minimum of 7 
resources 1s required. If this same TGP is drawn for all values of TBO = 2, it can be shown 


that the required number of resources does not exceed 7. 
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V. CONCLUSION 


A chinese proverb says “‘to know the road ahead, ask those coming back”. The 
approaches of the previous chapter give an insight into the relative merits and possible 
shortcomings of both the AN/UYS-2 and RC analysis. Through the perspective of other 
real-time systems an insight is gained into furtherance of the system’s performance 


possibilities. 


A. The RC Approach in Context 

Revolving Cylinder analysis is a flexible policy developed to improve the performance 
of the AN/UYS-2. It 1s unusual because it actually takes control and communications 
overheads into consideration when executing in real-time. Its attraction lies in its ability to 
reduce these overheads in the system while maintaining the fullest possible utilization of 
all processors. RC analysis can be implemented on a variety of architectures and has merit 


beyond the confines of the AN/UYS-2 architecture. 


1. Static vs. Dynamic Node - Processor Assignment 

The AN/UYS-2 schedules its nodes statically but allows the hardware scheduler to 
actually assign the nodes to a processor dynamically at run time. This keeps structure in 
the execution order of an application graph without introducing control overheads at run- 
time. RC scheduling gives a deterministic output flow rate with the caveat that the 
application’s nodes must have a regular (i.e., non-branching) execution profile. The trade- 
off is that the system cannot guarantee that determinism because of a lack of prior 
knowledge about where the system will execute each particular node of an application. 

CAPS uses a fully static scheduling approach to schedule and map execution nodes to 
a conventional processor. The use of the harmonic block allows the target machine to run 
a number of processes at different execution rates while still meeting real-time deadlines. 


RC scheduling in its current incarnation is rate-monotonic. The range of applications 
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suitable for the AN/UYS-2 can be increased by the addition of a flexible rate mechanism 
along the lines of CAPS. 

Static assignment such as that found in the Lincoln and Georgia Tech machines will 
increase the determinism of the machine’s output flow by inducing “lock-step” execution 
of each node. The AN/UYS-2 cannot implement this scheme without incurring huge 
communication penalties because of the common bus each resource uses to communicate 


with the scheduler and other processors/global memories. 


2. Throughput During High Demand Periods 

The performance of the AN/UYS-2 is improved under high loads with the 
implementation of the Revolving Cylinder [Akin 93] because of the increased determinism 
in throughput rates. The ATAMM approach seeks to control determinism through the 
control of data injection rates. While this does help induce regularity it loses some of the 
structure of the original data. This matters in the threat environment in which the AN/UYS- 
2 1S going to operate. 

The CAPS implementation suffers throughput degradation under high loads 
because slack is removed from the harmonic block and any kernel calls made will delay the 
execution of real-time processes past their deadlines. The inability to predict this delay 
through anything but statistical analysis is concerning in a real-time environment. 

SSIMD can achieve high throughputs but the Georgia Tech machine is more 
suited to problems of finer granularity than those handled on the AN/UYS2 because it loads 
an entire application onto each processor. The execution of small application graphs is 
faster on the machine because of the inter-processor communications but the architecture 
is not as flexible as that of the AN/UYS-2. 

The MIMD hardware of the Lincoln machine allows a locality of assignment not 
possible with the AN/UYS-2. The fact that the processors can communicate with each other 


without having to get on a common bus makes this an attractive idea. The ability to do this 
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reduces the non-determinism of output flow and improves throughput under high demand 


by dynamically assigning nodes to processors by proximity as well as availability. 


3. Determinism of Output Flow Rate 

The AN/UYS-2 implementation of RC scheduling produces output flow with a 
determinism that is dependent on the application graph’s execution profile. If the execution 
graphs are inherently non-deterministic due to branching, recursion, etc... then the system’s 
output flow reflects it. The SSIMD array can handle recursion smoothly by having one 
iteration of the recursion running on a processor fed into the next processor for another 
execution. This is not possible on the AN/UYS-2. The implied communications overhead 
burdens the data bus to the point where throughput is seriously degraded. 

Input injection rate control 1s the method that ATAMM uses to induce regularity 
in its output flow. This approach can improve the regularity of the AN/UYS-2 but the 
arbitrary loss of data 1s unacceptable. There may be ways to implement this of approach 
without specifically controlling the injection rate. This involves the system keeping current 
input on hand in a read buffer. As the input changes, the value of the buffer changes, but 


there is always current data on hand for the start of a new graph instance. 


B. Summary 

RC scheduling addresses the determinism of the response time of a data flow machine. 
Other research in the field of data-flow machines used in real-time environments, with the 
exception of Old Dominion, note the importance of such determinism and then either 
ignore the problem or use statistical profiles of an application to build in a response 
cushion. 

There are trade-offs in the approach insofar as deterministic execution profiles are 
required to produce deterministic output flows. More deterministic performance can be 


obtained from fully static scheduling policies but the RC approach offers a hybrid with the 
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flexibility and robustness of dynamic data-flow with some of the determinism and 


throughput performance of control-flow execution of each node. 


C. Possible Improvements to The AN/UYS-2 

The set-up, execution, and breakdown of nodes is one of the bigger overheads in the 
implementation of the RC schedule. Lee [Lee 90] addresses the concept of hardware 
implementation of functions normally performed through software. The main advantage of 
the approach is that each fetch, set-up, breakdown, and write is much faster if performed in 
hardware. This also enables processors to access shared memory the same way they access 
local memory. 

The addition of nearest-neighbor communication paths between the AP’s might allow 
more deterministic flow without the high overheads of a fully synchronous 
implementation. This parallels some of the ideas of the Lincoln labs machine without major 


alterations to the System hardware. 


D. Future Research 

The similarities of the Lincoln and Old Dominion machines to the AN/UYS-2 indicate 
that performance and throughput determinism gains are most easily found by mixing the 
balance of static and dynamic node scheduling. The RC technique 1s extremely good at 
wrenching deterministic output flow from an existing architecture without expensive 
modifications. These other approaches suggest that some gains can come from hardware 
changes and some few from software. 

The investigation of interprocessor communications and the modification of data 
atrival rates are two promising avenues for further improvement of the AN/UYS-2 and the 
RC technique. Each of these are implementable at low cost and have the potential to 


increase the system’s performance. 
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Another avenue of investigation is the CAPS system’s use of multiple rate execution 
times. This capability can add flexibility to the range of applications the AN/UYS-2 can 


handle and increase the life-span of the system for years to come. 
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